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Abstract 

With recent advances in sensing and tracking technology, trajectory data is becoming increasingly pervasive and 
analysis of trajectory data is becoming exceedingly important. A fundamental problem in analyzing trajectory data 
is that of identifying common patterns between pairs or among groups of trajectories. In this paper, we consider the 
problem of identifying similar portions between a pair of trajectories, each observed as a sequence of points sampled 
from it. 

We present new measures of trajectory similarity — both local and global — between a pair of trajectories to 
distinguish between similar and dissimilar portions. Our model is robust under noise and outliers, it does not make any 
assumptions on the sampling rates on either trajectory, and it works even if they are partially observed. Additionally, 
the model also yields a scalar similarity score which can be used to rank multiple pairs of trajectories according to 
similarity, e.g. in clustering applications. We also present efficient algorithms for computing the similarity under our 
measures; the worst-case running time is quadratic in the number of sample points. 

Finally, we present an extensive experimental study evaluating the effectiveness of our approach on real datasets, 
comparing with it with earlier approaches, and illustrating many issues that arise in trajectory data. Our experiments 
show that our approach is highly accurate in distinguishing similar and dissimilar portions as compared to earlier 
methods even with sparse sampling. 

1 Introduction 

Trajectories are functions from a time domain — an interval on the real line — to with d > 1, and observed as se- 
quences of points sampled from them. They arise in the description of any system that evolves over time and are being 
recorded or inferred from literally hundreds of millions of sensors nowadays, from traffic monitoring systems [26J 
and recordings of GPS sensors on cell phones p4| to cameras in surveillance systems and those embedded in smart 
phones, helmets of soldiers in the field 1 8 1, medical devices, and scientific experiments and simulations such as molec- 
ular dynamic simulations |18J . 

Taking full advantage of these datasets for creating knowledge and improving our decision making requires avail- 
ability of effective algorithms and analysis tools for processing, organizing, and querying these trajectory datasets. 
This step has been lagging behind partly because of huge scale of data, which is constantly growing, but because of 
other challenges as well. Trajectory datasets are marred by sensing uncertainty, and are heterogeneous in their quality, 
format, and temporal support. For instance, GPS measurements are only approximately accurate, and the inaccuracy 
can grow quite large in conditions of poor satellite reception such as urban canyons; many sensors operate on batter- 
ies, preferring to send aggregate or approximate measurements to conserve energy (leading to nonuniform and sparse 
sampling); trajectories retrieved from video typically arrive in bits and pieces. At the same time, individual trajectories 
can have complex shapes, and even small nuances lead to big differences in their understanding. 

A central problem in analyzing trajectory data is to measure similarity between a pair of trajectories and to identify 
portions that are common between the two trajectories. Besides being interesting in its own right, this problem often 
lies at the core of classifying, clustering, and computing mean trajectories. In some applications (e.g. handwriting 
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analysis, speech recognition), one of the trajectories is the ground truth and the other is an observed trajectory, while 
in some other applications (e.g. traffic analysis, video surveillance), both trajectories are observed data. 

Problem statement. Let P = (pi, . . . ,Pm) and Q = (g'l, . . . , ^n) be two sequences of points in R^, sampled from 
two trajectories 71 and 72 defined over the time interval [0, 1] with added noise. We assume that the trajectories are 
defined over the same time interval only for the simplicity of presentation; our models and algorithms work even when 
they are defined over different time intervals. We also assume that P and Q are points sampled from the images of 
the trajectories and, for simplicity, we ignore the temporal component again, our models and algorithms work even 
if we take the temporal component into account. 

Ideally, if we knew the two underlying trajectories 71, 72, then their similar portions can be defined by reparame- 
terizing them using two functions a : [0,1] ^ and /3 : [0,1] ^ R^ and identifying the sub-intervals of [0, 1] over 
which the two of them are similar. However, we only have finite sample points from each trajectory with additional 
noise, so we identify similar portions between the two by computing correspondences between sample points in one 
trajectory to points in the other, and vice-versa. Such a set of correspondences must satisfy the following criteria: 

(i) Similar portions of the trajectories must be identified even if the sampling rates are different. For example, 
common routes taken by moving objects should be identified even if GPS coordinates are obtained at highly 
different times. 

(ii) It must take into account noise/outliers, i.e., it must properly discriminate between noise and actual deviations 
of the two trajectories. 

(iii) It should work even if the trajectories are partially observed, i.e., some portion of a trajectory is missing. 

For any given set of correspondences between two trajectories, it is also advantageous to provide a score or measure 
indicating their degree of similarity. Such a score is not only useful in identifying a good set of correspondences but 
also as a way to rank multiple pairs of trajectories according to similarity, e.g. in clustering applications. 

The objective is to define an appropriate model for correspondences together with an appropriate scoring function 
to capture the above requirements and given two trajectories, compute correspondences with the optimal score efficient. 
In addition, we are also interested in identifying most similar contiguous sub-trajectories of two given trajectories. 
Thus, another objective is to compute local similarity between trajectories. 

Related work. Motivated by a wide range of applications, there is extensive work 
on computing similarity between two objects (or rather points sampled from the ob- 
jects) in many fields, including computational geometry, computer vision, computer 
graphics, and statistical learning. Two of the commonly used methods for measuring 
similarity are Hausdorff distance |3 | and earth-movers distance p2| , and their vari- 
ants. These methods are, however, not suitable for measuring similarity between two 
trajectories because they only focus on the location of points and not their ordering 
along the trajectories. A better choice is the so-called Frechet distance |2 | (see Fig.[T]). 
Informally, consider a person and a dog connected by a leash, each walking along a 
curve from its starting point to its end point. Both are allowed to control their speed, 
but they cannot backtrack. The Frechet distance between the two curves is the min- 
imal length of a leash that is sufficient for traversing both curves in this manner. A 
reparameterization is a continuous non-decreasing surjection a : [0, 1] [0, 1], such that a(0) 
The Frechet distance Fr(7i, 72) between two trajectories 71 and 72 is then defined as follows: 




Figure 1. A pair of trajectories with 
small Hausdorff distance but large 
Frechet distance. 



and a(l) = 1. 



Fr(7i,72) = inf rnax ||7i(a(t)) - 72(/3(t))||, 

a,/3tG[0,l] 



where || • || is the underlying norm (typically, as above, the Euclidean norm), and a and /3 are reparameterizations of 
[0, 1]. The Frechet distance between two polygonal curves with m and n vertices respectively can be computed in 
0{mn log(m + n)) time |[2|. 



^Strictly speaking, a trajectory is the graph of the underlying function, and what we have are the curves traced by the two trajectories, but we 
will not distinguish between the two. 
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Figure 2. Computing similarity between two trajectories using different distance measures: (a) Frechet distance, (b) dynamic time warping or 
average Frechet distance, (c) sequence alignment based method, (d) our model. 

If we only have samples of points on each trajectory, then a simpler variant, called discrete Frechet distance, is 
used, where the dog and its owner are replaced by a pair of frogs that can only reside on n and m specific stones P 
and Q on 71 and 72, respectively. These frogs hop from one stone to the next without backtracking, and the discrete 
Frechet distance is the minimum length of a "leash" that connect the frogs and allows them to execute such a sequence 
of hops. The discrete Frechet distance can be computed in 0{mn) time by a straightforward quadratic programming 
algorithm. Recently Agarwal et ah 1 1 1 have presented a sub-quadratic algorithm for this problem. 

Although the Frechet distance is often used as a measure of curve matching, in its traditional definition, an assign- 
ment yielding the Frechet distance is not necessarily a good indicator of the correspondences between the trajectories. 
The reason for this is because there could possibly exist a large number of correspondences which yield the optimal 
Frechet distance since the optimization criteria is the distance between the two farthest points in the coupling; see 
Fig.|2ja). Thus, in order to identify a good correspondence between the two trajectories, we must resort to the average 
Frechet distance, more commonly known as dynamic time warping distance, where we use the analogy of a pair of 
frogs hopping along P and Q, but the goal is to minimize the "average length" of the leash instead of its maximum 
length. Dynamic time warping (DTW) was originally developed for matching speech signals in speech recognition | [2T| 
and computes good correspondences between the two trajectories if they are similar for the most part, as is the case 
for trajectories corresponding to pairs of speech signals for a word or handwriting trajectories for a particular words 
or letters. 

However, if the trajectories contain significant dissimilar portions, possibly due to actual deviations (such as dif- 
ferent routes for GPS trajectories), the results are not as meaningful. Consider Fig.[2jb) which shows two trajectories 
starting with similar paths but where, one trajectory deviates from the other for a significant portion and then rejoins 
the original path. DTW tries to find a correspondence for all points and thus, gives correspondences for points in 
the deviation for which no meaningful one exists. Moreover, the actual measure is skewed by these portions and it 
is difficult to distinguish them from the similar ones using the obtained correspondences. It is even more difficult to 
distinguish actual deviations from outliers caused by measurement errors. 

There is a rich body of literature on pairwise sequence alignment in computational biology, where the goal is to 
identify similar portions between two DNA or protein sequences. Given two sequences A and B, their alignment is 
expressed by writing them in two rows respectively such that at each position in the first (resp. second) row, there is 
either a character in A (resp. B) or a blank character (termed as a gap character). Characters in the same column are 
deemed either identical or similar and characters in a sequence aligned to gap characters do not have a correspondence 
in the other sequence. A maximal contiguous sequence of gap characters is termed a gap, analogous to our notion 
of gaps; see Def. [T] The goal is to compute such an alignment which optimizes a similarity scoring function which 
assigns a score for aligning two characters and a penalty for gaps. The score for an alignment of two characters is an 
incentive if they are deemed similar and is a penalty if not. See fTO', cf. Chapter 2] for further details. 

In computational biology, the algorithms for sequence alignments have been extended to aligning two polygonal 
curves such as protein backbones. We may easily extend this model to the alignment of trajectories with the choice 
of an appropriate scoring function. We describe such a scoring function in Sec. |2] However, as Fig. |2jc) shows, 
non-uniform sampling rates cause similar portions to be designated as gaps since correspondences are restricted to be 
one-to-one. 

Various approaches of defining distance in order to distinguish similar and dissimilar portions exist f9l which 
aim to combine the advantages of DTW as well as sequence alignment. These include the longest common sub- 
sequence | [25| , an adaptation of the edit-distance measure | 7 | for sequences where trajectory points are designated a 
match if they are closer than a threshold distance and different otherwise, and the Edit-Distance with real penalties 
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measure jSj. The latter tries to incorporate the distances by evaluating dissimilarity of points by their distance to 
a fixed reference point. However, it is not clear how this point should be chosen and why it is a good indicator of 
dissimilarity. A drawback of all these measures is that they do not distinguish between noise and actual dissimilarity. 
In fact, in the application scenarios under consideration in these works, noise is the primary reason for the dissimilarity 
since the underlying curves are identical unlike the general setting. 

Regarding the problem of computing partial or local similarity of curves, Buchin et al. |4 | consider the problem 
of sub-trajectory similarity under a continuous version of the average Frechet distance where a user-defined parameter 
specifies the minimum sub-trajectory duration. Further, in |5|, the authors show how to compute a "partial" Frechet 
distance, i.e., a maximal portion of the trajectory curves which are close enough. However, their algorithm does not 
work under the Euclidean norm. 

Apart from spatial similarity of trajectories, various other patterns have also been considered. In |[T3][T6|, the 
authors consider various formalizations of interesting patterns such as flock, where a set of trajectories follow each 
other for a long duration and convergence where a set of trajectories converge at a particular location. In |T4l[27| , 
similarity of soccer player trajectories is considered based on significant events such as passing or shooting. 

Finally, as mentioned above, computation of similarity is often a requirement for clustering trajectories. We refer 
to |THP^[2Q| for a few such approaches. 

Our contributions. Our first contribution is the introduction of a new model for correspondences as defined using 
the notion of assignments in Sec. |2] Together with this model, we design an appropriate way to score assignments 
which can be used to rank pairs of trajectories according to similarity. The model and scoring function provides a 
unified framework which encompasses all previous approaches for computing trajectory similarity such as dynamic 
time warping, sequence alignment, edit-distance and others while satisfying the requirements presented previously in 
the problem statement. Our model may also be extended to spatio-temporal data by treating time as an extra dimension. 
For conciseness, we ignore the temporal component in the rest of the paper. 

Next, in Sec. |3] we describe a quadratic time dynamic programming based algorithm for computing optimal as- 
signments between two trajectories P and Q under our scoring function. Our algorithm combines ideas from sequence 
alignment and dynamic time warping and matches the asymptotic running time complexity of these approaches. 

In Sec. [4] we show that our model can also be used for performing the so-called local assignment for identifying 
most similar sub-trajectories between two trajectories P and Q. This is done by adapting the model slightly; the 
algorithm remains the same. Computing local assignments is similar to the notion of local sequence alignment in 
computational biology. 

If the points on P and Q are sampled non-uniformly and at different rates, then mapping each point of P to a point 
of Q may lead to unnatural alignments. We therefore consider a semi-continuous model in which we interpolate each 
trajectory between consecutive sampled points, say, by a linear function, and allow a point of P or Q to be assigned 
to an interpolated point of the other trajectory (cf. Sec. [5]). This extension leads to considerably better assignments in 
cases where sampling rate differences are significant while measurement noise is less significant. 

Finally, we show the effectiveness of our model (in Sec. |7]) by evaluating it on a number of real datasets and 
comparing it with sequence alignment, dynamic time warping, and a modified dynamic time warping heuristic to 
distinguish similar and deviating portions between pairs of trajectories. In practice, our model captures the advantages 
of both dynamic time warping and sequence-alignment based approaches with none of their drawbacks. 

2 Model 

Let P and Q be two sequences of points as defined above. We describe our model for measuring similarity between 
P and Q and finding common portions of them. Our model builds on the strengths of both dynamic time warping 
(DTW) and sequence alignment. Recall that 

• DTW does not handle deviations well because it tries to match every point on P and Q, but it can handle non- 
uniform sampling of points by allowing multiple points of P to match with one point of Q, and vice versa; see 
Fig.igb). 
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• The sequence alignment model identifies deviations well while distinguishing them from noise but does not 
allow multiple points of P to match with one point of Q, and hence, has issues with non-uniform sampling; see 

Fig.igc). 

The following notion of assignments models correspondences between P and Q which captures the advantages of 
the above approaches without their drawbacks. Together with an appropriate scoring function defined later, assign- 
ments accurately reflects the degree of similarity of one trajectory to the other by satisfying the requirements posed in 
the problem statement in Sec. [T] 

Definition 1 An assignment /6>r P and Q is a pair of functions a : P ^ Q U {±} and (3 : Q ^ P U {±} for 

the points of P and Q respectively. If a{pi) = ± (or ^{qj) — \_), then pi {or qj) is called a gap point. A maximal 
contiguous sequence of gap points in P or Q is called a gap. 

An assignment is monotone if it satisfies the following conditions: (i) ifa{pi) = qj then for all i' > i, a{pif) = ± 
or a{pi') = qj' for some j' > j, and (ii) if j3{qj) = pi then for all j' > j, P{qj') = ^ or P{qj') = Pi' for some i' > i. 

Intuitively, if a point pi e P lies on a similar portion of the two trajectories then a{pi) defines the point on Q 
to which Pi corresponds, and Pi is a gap point otherwise. A similar interpretation holds for Unlike traditional 
alignment/matching models, our assignments are asymmetric which allows it to better adapt to trajectories with dif- 
ferent sampling rates. The notion of gaps is introduced to identify deviations between the two trajectories. Using gaps 
enables the identification of a "good" assignment even if there are only partial observations on any of the trajectories; 
we can compute an assignment for the observed portions. Distinguishing between noise and dissimilarity can be ac- 
complished by restricting to assignments where gaps are sufficiently long (short deviations are more likely to be due 
to noise and thus, the underlying portions of the trajectories are similar). 

It will be easier to view an assignment of P and Q in terms of the complete directed bipartite graph G := 
G{P^Q) = {PUQ^PxQuQxP), i.e., G has a directed edge {pi^qj) and another directed edge (qj^Pi) for 
every pair G P and qj e Q. We say that a pair of edges (p^, qj) (or (qj^Pi)) and {pk, qi) (or {quPk)) cross ifi <k 
and j > I or vice versa. Under our definition, the opposite edges (p^, qj) and (qj^Pi) do not cross each otherj^ 

Hence, in this perspective, a monotone assignment is a set E of pairwise non-crossing edges in G so that E has at 
most one outgoing edge from every point in P U Q. Points with no outgoing edges are gap points and, as in Def.[T} 
a maximal contiguous sequence of gap points in P or Q is called a gap, and the length of the sequence is called the 
length of the gap. Let r(^) denote the set of gaps in P and Q for the assignment E. We define the score of E, denoted 
bycr(P,Q;£;),as 

a{P,Q;E)= ^ ----^ -+ ^ a + A|^|, (1) 

^-^ c + m — v\r ^-^ 

where a, A and c are parameters which are chosen carefully, as described in Sec. |6] || • || is the L2-norm and \g\ 
is the length of a gap g. For appropriate comparison between different pairs of trajectories and their corresponding 
assignments, the above score may be normalized in a straightforward manner to provide a score between and 1. We 
can rewrite the score in terms of the functions a, (3 from Def.[T]as well: 

a(P,Q;a,/?)= E e+|b.-a(,OP 

+ E e+ II,, -/?(,,) P + ^ (« + A.|,|), (2) 

where r(a, /3) is the set of gaps in P and Q for the assignment a, /3. 
We define the similarity between P and Q as 

a(P,Q) = max cr(P,Q; a, 



^Note that our definition of the crossing is topological, defined in terms of the graph G. the line segments piqj and p^qi corresponding to two 
non-crossing edges (p^, qj) and (p^, qi) may cross each other (geometrically), and may be disjoint even if the graph edges are crossing. 
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The assignment a*, /3* = arg maxQ,^ cr(P, Q; ol-, P) identifies the similar portions of P and Q. 

Why the directed graph? We now explain why we chose a directed graph model and an assignment. Both DTW 
and sequence alignment can be formulated as computing a subset of non-crossing edges in the complete undirected 
bipartite graph P x Q. In particular, DTW finds a subset D C P x Q of non-crossing edges such that each vertex of 
P U Q is incident on at least one edge of D and the total "length" of edges in D is minimum. In some applications, an 
appropriate choice for the length of edges would be the Euclidean distance whereas in other applications, a different 
function is used. On the other hand, the sequence- alignment model asks for computing a matching M C P x Q whose 
score is maximum, where the score of a matching is similar to ([T]). 

It is tempting to describe an assignment as well with respect to the undirected graph but this leads to difficulties. For 
example, we may relax the condition of finding a "matching" from the sequence- alignment model allowing multiple 
points of P to match with one point of Q, (see Fig.[3ja)) and simply ask for finding a set of non-crossing edges whose 
score is maximum, but this is not always meaningful and introduces additional edges that are redundant. For example, 
in Fig.jsj^b), this approach will find three edges — the diagonal edge (p2, ^i) is spurious and is an artifact of the model 
because it is more meaningful to match pi with qi , p2 with q2 and vice versa. 

The directed graph model avoids this problem by not requiring the functions a and (3 to be symmetric; see Fig.|4] 



Pi P2 Ps P4 Pi P2 




Figure 3. Pros and cons of allowing multiple matches in an undi- 
rected graph:(a) A situation where allowing multiple matches is Figure 4. Using a directed graph provides a logical set of corre- 
logical. (b) A situation where allowing multiple matches does not spondences always. Comparison with the examples of Fig.[3] 
allow us to obtain a clear correspondence. 



There are a few other subtle advantages of using the directed graph model, omitted from this abstract due to lack 
of space. 

Remark, (i) Our framework is not limited to the scoring function ([T]). For example, the sequence-alignment based 
approach, DTW or other measures such as edit-distance are easily incorporated into our model, (ii) Note that according 
to our definition of assignments, vertices in G which have incoming edges but no outgoing edges are designated as 
gap points. This may be easily modified so gap points have no incoming edges. 



3 Algorithms 

We now describe an algorithm for computing the optimal score cr(P, Q) and the corresponding assignment. Our 
algorithm is similar to that for sequence alignment 1 19] and runs in 0{mn) time. The main difference is that, because 
of the asymmetric nature of the definition of assignments, the recurrence relation is more complex and we need to 
compute a few auxiliary assignments and score functions. 

Auxiliary functions. For 1 < i < m, let P^ = (pi, . . . denote the prefix of P of length i, and for 1 < j < n, 
let Qj = (gi, . . . , Qj) denote the prefix of Q of length j. The algorithm computes the similarity score cr(P^, Qj) for 
all 1 < z < m and I < j < n. For brevity, we will use to denote a{Pi,Qj). Let ^{i^j) denote the (optimal) 

monotone assignment corresponding to the score cr{i^j). With a slight abuse of notation, we will use both the graph 
representation — a set of non-crossing edges — as well as the pair of functions a, /3. It will be clear from the context 
which of the two representations we are referring to. 

For each pair j, to compute cr{i^j) and ^{i^j) efficiently, we also compute a set of auxiliary functions described 
below: 
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a{ij) = max|crx,*(^, j)^cr*,^(z, j),cr0,*(i, j) j),cr*,0(i, j) j)} (3) 

(^±,^{hj) = max|cr(i - + a + A,a^,*(i - 1, j) + A, j) -^S{iJ), 

(4) 

(^ci^A^^j) = max|cr0,^(i - 1, j),cr*,0(z - 1, j) j),cr0,*(z, j - 1) j)^ 

(5) 

(T(f,^±{iJ) = max|cr0,*(z, j - 1) a ^ A, a^^±{ij - 1) + A,cr*,x(^ - 1, j)}, (6) 
j) = max|cr*,^(i - 1, j) + a + A,crx,±(^ - 1, j) + A,cr^,4z, j - 1) + a + A, 

ax,±(^,j-l) + A}. (7) 

Figure 5. Recurrence relations for a and each of the auxiliary functions. The relations for cr*,^, cr^ ^ are symmetric to those above for cr^^* and 
(^ct>,± respectively. Here, S(i,j) = l/(c + - 

• and cr^^±{i^j)\ cr±^^{i^j) denotes the score of the best monotone assignment, which is denoted by 

j), for Pi and Qj with the restriction that pi is a gap point. That is, there is no outgoing edge from pi, 
but our model allows T>±^^{i^j) to have incoming edges to Pi. We similarly define cr^^±{i^j) and E*^^ (i, j). 



j)'- the score of the best monotone assignment, denoted by {i-, j), for Pi and Qj with the restriction 
that both Pi and are gap points. 

• j) and a^^(f){i,j)\ (J(f)^^{i^j) is the score of the best monotone assignment for Pi and Qj, denoted by 
^ci),^{hj)^ with the restriction that pi is not assigned to any point of Qj but points of Qj can be assigned to 
Pi, i.e., there is no outgoing edge from pi but there can be incoming edges to Pi. The difference between 

j) and ax,* j) is that pi is considered as a gap point in the latter and cr±^^{i^j) includes the gap score 
corresponding to Pi, namely a + A if a new gap starts at Pi in 12 j) and A otherwise, while in the former 
no score is added corresponding to Pi. We define (J^^(f){i^j) and 12^^(f){i,j) analogously. 

• o-±^(f){i^j) and cr(f)^±{i, j)\ (J±^(f){i^j) is the score of the best monotone assignment for Pi and Qj, denoted by 

with the restriction that Pi is not assigned to any point of Qj as in the previous case, and Qj is a gap 
point. Note that there are no outgoing edges from pi or qj but there may be incoming edges to one or both of 
them, and that cr±^(f){i^j) does not include any score corresponding to pi but does include a gap score for qj. We 
define (J(f)^^{i^j) and S^^^l^, j) analogously. 

Recurrences. We now describe recurrence relations for each of the auxiliary score functions just described, so that 
each value can be computed in 0(1) time using dynamic programming. Fig. [S] describes the recurrence relations. For 
brevity, we set (5 (i, j) = l/(c + \\pi — qj\\^). Because of lack of space, we derive the recurrences only for cr{i,j) and 
j); others can be derived in a similar manner. Also, the corresponding assignments can be computed using the 
standard backtracking method 1 19 |. 

Recurrence for cr{i^j). Consider the optimal assignment for Pi and Qj. There are three possibilities: (i) pi is 

a gap point, (ii) qj is a gap point, and (iii) there are outgoing edges from both pi and qj. In cases (i) and (ii), ^{i^j) 
is j) and respectively, by definition. So consider case (iii). Suppose ^{i^j) has (directed) edges 

{PiiQj') and {qj^Pi')- Since the edges in T,{i^j) are non-crossing, / < j implies that i' = i and similarly i' < i 
implies that / = j. Hence, contains at least one of (p^, qj) and {qj^Pi). In the former case, T>{i^j)\{{pi^qj)} 

is nothing but cFd^^^ii^ j), and in the latter case, ^{i^j) \ {{qj^Pi}} is i;*,0(i, j). Thus, cr{i^j) satisfies recurrence (jsl. 
Recurrence for cr±^^{i^j). There are three cases for the assignment T>±^^{i^j): (i) qj is a gap point, (ii) there is an 
incoming edge to pi, (iii) there is no incoming edge to pi. In case (i), T>±^^{i^j) = T>±^±{ijj), since both pi and qj 
are gap points. So assume qj has an outgoing edge in j). 
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If there is an incoming edge to pi, then, by the non-crossing property, T>±^^{i^j) contains the edge {qj^Pi) and 
thus, we have j) \ {{qj^Pi)} = j). On the other hand, if j) does not have an incoming edge 

to Pi, then j) is the same as — 1, j) or — 1, j) depending on whether a new gap starts at pi in 

j) or a gap containing pi starts earher. Putting everything together, ^ gives the recurrence for cr^^*(z, j). 

We maintain a separate table for each auxiliary function and compute the entries in increasing order of i and j. 
For a fixed pair j, we compute them in the following order: crx,±, crx,^, cr(f),*,cr^^(j), cr^,*, cr*,x, cr- It can be 
verified from ([3])-(|7]) that each of them can be computed in 0(1) time. Hence, the total time spent in computing the 
final cr(P, Q) and S(P, Q) is 0{mn). If we maintain the entire tables, the space used is also 0{mn) but it can be 
reduced to 0(m + n) 1 15 |. We thus obtain the following: 

Theorem 3.1 Given two sequences of points P and Q in of lengths m and n respectively, the similarity score 
a{P, Q) and the corresponding assignment can be computed in 0{mn) time using 0{m-\-n) space. 



4 Local Assignment 

We now describe how we modify our model and algorithm, described in Sections [2] and |3j for finding the most similar 
sub-trajectories between two trajectories, or the so-called local assignments. Intuitively, during the course of the 
execution of the algorithm from Sec.|3] when we find that the score of aligning initial portions of the trajectories is too 
small, we should discard them from further consideration and start afresh. However, the score of assigning a point pi 
to qj, or vice versa, is always positive. So, under this model, there is no incentive for starting afresh. We therefore, 
modify the score of an assignment E, represented as a set of non-crossing edges, as follows: 

aiP,Q;E)= ^ _ + ^ (a + {A - r)\g\) . (8) 



{u,v)eE 



geriE) 



Here, r is a threshold parameter that we subtract from each term in ([T]). The value of r indicates how low we would 
like the score to go before we decide that it is too low for local similarity and choose to start afresh. As earlier, our 
goal is to compute the maximum score and the corresponding assignment: 

cr(P,Q) = maxcr(P,Q;£;); S(P,Q) = arg maxcr(P, Q; £;), 

E E 

where the maximum is taken over all monotone assignments. 

Following the same ideas as in the algorithm for local sequence alignment | [23| , we can compute (j(P, Q) in 0{mn) 
time. More precisely, we modify the recurrences in ([3])-(|7]). For example, 

cr(z, j) = max{cr^,*(i, j),cr*,x(^, j),cr0,*(^, j) + ^(^, j), cr*,0(i, j) + j), O}. 

The others are modified accordingly, by adding the term. Finally, instead of returning cr(m, n), we return the score 
max^^j cr{i^j). Omitting details, we conclude the following: 

Theorem 4.1 Given two point sequences P and Q of lengths m and n respectively, an optimal local assignment for 
them can be computed in 0(mn) time. 



5 Semi-Continuous Model 

As discussed above, the (asymmetric) assignment model is robust to sampling rates since it tries to assign each point in 
one trajectory to a similar point in the other trajectory if one exists. In many applications, where data is being acquired 
by sensors on a network and energy or communication is expensive, the sampling may be very sparse at some portions. 
If we assume that P and Q are points sampled from a continuous process, then we may wish to model this continuous 
process and assign each point of P (or Q) to a point on this continuous trajectory which is not necessarily one of the 
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(a) (b) 

Figure 6. Discrete (a) vs semi-continuous (b). 



input points. For example, consider Fig.|6] Due to the different sampling rates of the trajectories, it is better to assign 
endpoints in one trajectories to points in between endpoints of the other trajectory. 

A commonly used model for trajectories is to connect two consecutive sampled points by a line segment — 
resulting in a polygonal curve. Let P and Q denote these curves for P and Q respectively. Ideally, we would like to 
define a monotone assignment as a pair of functions a : P ^ Q [J {!-} and /3:Q^PU{±} and the goal will be 
to compute a monotone assignment with the maximum score, as defined in ([2]). This is however, very hard to compute 
because of the algebraic complexity of the assignment; the description of the points on P and Q to which Q and P are 
mapped can be quite large. Because of lack of space, we omit a lower-bound construction from this paper. 

We circumvent this problem by sampling points on each edge of P and Q — for each point pi e P and every 
edge QjQj-^-i of Q, we sample the closest point of Pi, = arg minxe qjqj+i \\Pi ~ QjOj+i- We do the same for 

every point qu ^ Q and every edge PiPi-\-i of P. Let P* and Q* be the resulting sequence of points. We can then run 
the previous algorithm on P* and Q*. Since |P*|, |(5*| < mn, the total running time is 0{rin?n^). Besides the high 
running time, the sampling step is somewhat ad-hoc. 

To improve efficiency and avoid this large sampling step, we propose a method of "cheating" to assign points in 
one trajectory to points in between endpoints in the other trajectory if no endpoint is a good match. At each step of the 
dynamic programming algorithm, we examine prefixes Pi and Qj. At this step, we may choose to assign a{pi) = qj 
or I3{qj) = Pi depending on their contribution to the overall score which is based on the distance \\pi — qj\\. We modify 
this slightly by assigning a{pi) = qj-ij where qj-ij is the closest point on the segment preceding qj. Similarly, we 
assign = 

Although the assignment is now no longer restricted to be monotone, in practice, we expect that in the similar 
portions of the trajectories, such an assignment would be monotone and further, it would capture the correspondences 
better than the discrete setting. Moreover, the running time is now 0{mn) which matches the efficiency of the discrete 
algorithms. Fig. [6] shows a comparison of the two settings for a toy example. This example is very similar to that 
of Fig. [2] but with different sample points. Note that the semi-continuous model obtains a much more regular set of 
correspondences of the positions along the trajectories than the discrete setting. 



6 Parameter Selection 

Parameter selection is an important issue when choosing the scoring function. During the course of the algorithm, 
when examining a pair of points Pi ^ P and qj G Q, the difference in values l/(c + \\pi — qjW^ versus A dictate the 
choice of whether to assign a{pi) = qj or Piqj) = Pi versus assigning one or both as gap points. 

We work with the hypothesis that all "matched" points have roughly the same distance. Let the threshold on the 
distance beyond which points are dissimilar be r. We suggest the following choices of the parameters: 

r 1 
c=-, A = a = -/A, 

2 c + 

where / indicates a minimum gap length. If / = 0, the algorithm simply chooses points which are farther apart than r 
to be gap points. Otherwise, it only chooses to start gaps when at least / points are farther than r apart since, if not, 
a + A • 1^1 will be negative. 
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The choice of r is clear if we have semantic information about the trajectories such as what type of entities 
generated them. For example, if we have GPS trajectories from road networks and wish to classify similar portions as 
those following the same road, then the width of the road and the sampling rate or GPS accuracy would determine the 
choice of r. 

On the other hand, in many situations, the choice of r is not clear. For example, if we are given road trajectories 
but the roads may consist of both highways as well as side streets or alleyways, the widths may vary widely. In such 
cases, we wish to infer the value of r from the dataset. Unfortunately, this is a hard inference problem. One could use 
a Bayesian method for this problem, but this is beyond the scope of this paper. 

Instead, we use a simple iterative algorithm which proceeds according to the following steps: 

(1) Start with a rough guess of the upper bound r and compute an assignment with r = r. 

(2) Discard a percentage of the larger distances in this assignment and compute the rms of the remaining distances. 

(3) Choose a new threshold which is a small factor of the rms, say r = ci • rms. 

(4) Repeat steps (2) and (3) until the assignments converge. 

When we identify all similar portions and dissimilar portions, we do not expect the rms to change significantly 
when we discard some of the larger distances. Hence, the assignments should converge and the algorithm should 
terminate. 

We may set different criteria for termination, particularly if we have additional information. For example, if 
we have a lower bound on the threshold, we can terminate when we reach this value, since the algorithm need not 
necessarily terminate at this point. If we know the characteristics of the noise on the measurements, for example, if it 
is Gaussian, we can fit a Gaussian to the remaining distances and converge when it is a good enough fit. 

7 Experiments 

We have conducted an experimental study on real datasets to evaluate the effectiveness of our model and to compare 
with previous approaches. We used the DTW and sequence-alignment based approaches to provide a basis for com- 
parison of our approach. We refer to these algorithms as DTW and Seq- ALIGN and to our algorithm from Sec.|3]as 
Assignment throughout this section. 

Since DTW tries to find correspondences for every point on a trajectory and consequently, does not aim to dis- 
tinguish between similar and deviating portions, we applied a simple heuristic to make the comparison with ASSIGN- 
MENT and Seq-Align fairer: after computing the DTW correspondences, we pruned all correspondences where the 
distance between the two constituent points was greater than a threshold. Recall that the parameter selection for our 
scoring function (cf. Sec. [6]) uses a similar distance threshold r. We call this modified DTW heuristic DTW-Pruned. 

We present two types of results in this section: (i) an analysis of the nuances of ASSIGNMENT, DTW and Seq- 
Align by examining individual pairs of trajectories and their resulting correspondences, and (ii) characteristics of the 
data based on the correspondences computed over all pairs of trajectories in the datasets. The goal of our experiments 
was not only to present a qualitative study but also to observe the characteristics of the data and how they impact the 
model parameters. 

7.1 Description of Datasets. 

We have used three real datasets in our experiments: (i) the Geolife project by Microsoft Research Asia p8f[3Q| 
consisting of 17,621 trajectories of 182 users in China, (ii) 145 trajectories of school buses in Athens, Greece 1 12], and 
(iii) 38 trajectories from road cycling exercises captured by a fitness GPS device at a constant one second sample rate. 
Many of the trajectories in the GeoLife dataset are labeled with the mode of transportation used from the set {biking, 
walking, running, bus, car, taxi, train, subway, airplane}. We extracted the trajectories with labels in {biking, walking, 
running} and extracted a sample of trajectories for our experiments. Our subset consists of a total of 883 trajectories. 
Fig.jTJa) shows the portion of this dataset where trajectories have the highest concentration which is, not surprisingly, 
in Beijing. Fig.jTJb) andjTJc) shows the other two datasets. 
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(a) (b) (c) 

Figure 7. Datasets: (a) GeoLife datasets of trajectories with transportation modes in {biking, walking, running} respectively, (b) Buses dataset, (c) 
trajectories from road cycling exercises recorded with a fitness GPS device. Sample points are blue while the edges between two points are red. 



Although all three datasets are GPS trajectories, they differ significantly from each other due to the different modes 
of transportation and sensor equipment as well as due to road network characteristicsin the buses dataset, the sampling 
rate is fairly infrequent and the data also contains significant measurement noise. On the other hand, although the 
sampling rate is more frequent for the biking, walking and running trajectories in the GeoLife dataset, here too, it 
contains significant measurement noise. In addition, there seem to be a large number of partially observed trajectories 
(long stretches with no samples). The dataset of cycling exercise routes collected is highly accurate and extremely 
densely sampled making it a good representation of ideal conditions. 



7.2 Results on Pairs of Trajectories 

To show the effectiveness of our algorithms, we present results on a single pair of trajectories from each of the datasets. 
We chose the pairs in such a manner that they exhibit significant similar portions as well as dissimilar portions and have 
sufficiently dissimilar sampling rates in the case of the pairs from the Bus and GeoLife datasets (the cycling exercise 
trajectories all have a uniform sampling rate). Consequently, the results exhibited on these pairs are representative 
of the results on other trajectories as well. In all cases, we perturbed one of the trajectories slightly in the figures to 
present the results more clearly. 

Global Assignment. In each case, we compare the results of Assignment to the results of DTW, DTW-Pruned 
as well as Seq-Align. We chose a distance threshold r = 100 m for our parameter selection process for Assign- 
ment (see Sec. [6]) as well as for DTW-Pruned. The minimum gap length was set at 4 for both Seq-Align and 



Assignment. Figures [8] [9] and 10 show the results of computing assignments on the pair of trajectories chosen from 
each dataset. In the former two, we have also shown in an inset in some cases: a zoomed in portion of the trajectories 
where the issues with the different approaches are clearer. 

The correspondences computed by DTW are clearly not meaningful in differentiating deviating and similar por- 
tions (see Figures [8ja), |9ja) and[TOja)). On the other hand, for the buses and GeoLife data, DTW-Pruned clearly 
does much better in identifying deviations (see Figures [8jb) and|9jb)) but generates a number of smaller gaps due to 
some correspondences which may be desired having larger distance than the threshold possibly due to measurement 
noise. Seq- ALIGN which does impose a minimum gap length to get around this issue also has a similar behavior. 
Here, however, this is caused by the restriction to one-to-one correspondences. Finally, ASSIGNMENT captures the 
advantages of the other approaches by generating correspondences for almost all points on the similar portions (see 
Figure [8jd) and[9jd)). In this case, the imposition of a minimum gap penalty generates correspondences which may 
be desired in spite of having distance larger than the threshold due to the surrounding correspondences. Due to the 
allowance of multiple incoming edges for a point in ASSIGNMENT, the sampling rate issues are also handled well. In 



the exercise cycling dataset (see Fig. 10), all approaches perform well due to the uniform sampling rate, velocity and 
accuracy of sampling. 

We also present histograms of the distances of all corresponding pairs for DTW and ASSIGNMENT for the buses 



and cycling datasets in Fig. 1 1 The x-axis is set at a log-scale to better visualize the histograms. The bins containing 



the mean and rms values are highlighted in yellow and red respectively with no red bin indicating that the rms and 
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(a) DTW 



(b) DTW-Pruned 





(c) Seq-Align (d) Assignment 

Figure 8. Results on a trajectory pair from the buses dataset. 





(a) DTW 



(b) DTW-Pruned 




(c) Seq-Align 



(d) Assignment 



Figure 9. Results on a trajectory pair from the GeoLife dataset. 




(c) Seq-Align (d) Assignment 

Figure 10. Results on a trajectory pair from the cycling dataset. 
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Distance of Correspondences Distance of Correspondences 

(c) Cycling dataset, DTW (d) Cycling dataset, Assignment 

Figure 11. These histograms summarize the distance between corresponding pairs in the results from Figures |8] and Qo| (a) and (b) correspond 
to FiguresjSja) andjsjd) respectively while (c) and (d) correspond to Figures [Tota) andQofd) respectively. The bins containing the mean and rms 
values are highlighted in yellow and red respectively with no red bin indicating that the rms and mean lie in the same bin. The vertical blue line 
denotes the threshold distance. 



mean lie in the same bin. Comparing Figures[TTJa) and[TTJb), we notice that the assignment retains some edges whose 
distance is larger than the threshold r (shown as a blue dashed line). This is due to the minimum gap length constraint 
which is necessary due to noise in the data. As mentioned above, including correspondences nearby affect the chance 
of including a correspondence in the result which is more meaningful than simply pruning based on distance as in 
DTW-Pruned. 

Local and Semi- Continuous Assignment. We present qualitative results for local and semi-continuous assignments. 



Fig. [12] compares the discrete and semi-continuous assignments for the trajectories chosen from the bus dataset. For 
the semi-continuous case, we have used the method of choosing the closest point during the course of our algorithm 
described in Section |5] and not the up-sampling approach. Note the regularity of the segments indicating the corre- 
spondences (shown in green) as compared to the discrete setting. This is also a good indication that the variance of 
assigned distances is much smaller and reflects the characteristics of the routes taken by the trajectories. 



Fig.[T3]shows the best of the individual local assignments computed for the pair of trajectories chosen from the bus 
dataset under both the discrete and semi-continuous setting. We chose the parameter r used to decide a lower bound 
on the scores of the local assignment (see Sec.|4]) to be 1.5 A and 2 A respectively for the discrete and semi-continuous 
cases. On the other hand, the local assignment computed in the discrete setting captures a shorter portion where the 
sampling is more uniform. 

Effect of Parameters. Recall that the parameters in the scoring function are all set based on a threshold distance for 
gaps (see Sec. [6]). Fig.[T4]shows how the rms of the optimal assignment varies as a function of the distance threshold 



r 



used to define the parameters of a. For each distance r on the x-axis we defined the parameters a, A, c and computed 



the rms of the optimal assignment based on those parameters. The axes in Fig.[T4]are set at a log-scale due to the much 
larger variation in distances and extremely low distance threshold in the converged assignment. When we choose the 
"correct" r for the algorithm, we expect variation in distances of assignment edges to be only due to noise. Hence, the 
rms of the assignment edges will be comparable to the chosen r (if the actually deviating portions are at sufficiently 
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(a) Discrete (b) Semi-continuous 

Figure 12. Comparison of discrete and semi-continuous assignments. 



(a) Discrete (b) Semi-continuous 

Figure 13. Local assignments. 




10'^ 10'^ 10'^ 

Distance Threshold 



Figure 14. Illustration of the root-mean- square of distances between the pairs in the optimal assignment as a function of the distance threshold 
used to define the parameters of a (cf. Section[6}. The red dots on the curve corresponds to the rms values produced during the iterative parameter 
selection algorithm. 



large distances relative to the noise). This is reflected in the flat portion of the graph in Fig 14 The red points in 



Fig. 14 are the chosen thresholds of the iterative procedure and as can be seen, it converges (leftmost red point) when 
the chosen threshold is comparable to the rms. 

Finally, Fig. [15] shows the assignment and histograms for the pair from the buses and cycling datasets at the 



beginning, end and an intermediate step of the iterative approach. As we can see, larger distances are slowly pruned 
away until we reach the optimal assignment. In the bus dataset, however, there is still significant variance in the data 
at the time of convergence showing the low sampling rate and noise inherent in the data. 



7.3 Results on Complete Datasets 

To further demonstrate the effectiveness of our algorithms, we present here results showing the behavior on the datasets 
as a whole. Based on pairwise correspondences computed by the DTW, DTW-Pruned and Assignment algo- 
rithms, we analyze the importance of a sample point on a trajectory based on its correspondences with all other 
trajectories. Specifically, we wish to measure the importance of a portion of a trajectory with respect to the dataset 
using the importance of individual points on it. In all cases, we compute pairwise correspondences for all pairs of 
trajectories in the dataset. Based on these, we define the importance of a sample point p on some trajectory P in three 
different ways for the three algorithms such that the comparison is as fair as possible: (i) for ASSIGNMENT, we define 
it as the number of outgoing edges for a sample point, i.e., the number of trajectories Q such that for some q ^ Q, 
= (7 in the assignment., (ii) for DTW-Pruned, we define it in a similar manner, the number of trajectories Q 
such that (p, q) for some g G Q is part of the set of correspondences between P and Q, (iii) for DTW, it is not possi- 
ble to define similar to DTW-PRUNED since then, the importance of all points is the same; hence, we define it as the 
number of correspondence pairs that p is part of over all pairwise assignments. For ASSIGNMENT, the outdegree of a 
point is an intuitive way to define the importance since it provides a natural way of defining the number of trajectories 
which share a common portion with P at p. We chose the definition of importance of DTW-PRUNED to reflect the 
same intuition. 

Again, we chose 100 m as the distance threshold for DTW-PRUNED and ASSIGNMENT and 4 as the minimum gap 
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Distance of Correspondences 



Distance of Correspondences 



Distance of Correspondences 



(k) 



(1) 



Figure 15. The assignments and corresponding distance histograms computed at various stages of the iterative algorithm, (a), (b), (c), (d), 
(f) correspond to the buses dataset pair while (g), (h), (i), (j), (k) and (1) correspond to the cycling dataset pair, (a), (d), (g) and (j) are at 
iteration, (b), (e), (h) and (k) are at an intermediate stage and (c), (f), (i) and (1) are at the point of convergence. 



(e) and 
the first 



length for ASSIGNMENT. Figures 16 17 and 18 show the heat maps of the importance of the points belonging to all 
trajectories in the buses, GeoLife and cycling datasets respectively. As is clear, the DTW approach does not provide 
any meaningful results, so we present it for completeness. Comparing DTW-Pruned and ASSIGNMENT, the results 
are quite similar. Both of these seem to do a good job of identifying commonly traveled routes and landmark points 
in the dataset. ASSIGNMENT seems to do a slightly better job of identifying more central points along the routes or 
points on a so-called "mean" trajectory although this is apparent only upon close examination. Further analyzing the 
behavior of trajectories based on the complete dataset is outside the scope of this paper and we present these results 
here as a flavor of what our framework can provide for this purpose. 



7.4 Summary 

We have shown that our framework captures the advantages of both DTW and sequence alignment based approaches 
for identifying trajectory similarity, and that it is able to exceed their accuracy. Experiments show that the approach 
is highly accurate in identifying similar portions of trajectories from real datasets. Further, even without an accurate 
prior knowledge of distances between points based on which to compute similarity, our iterative procedure is able to 
converge at the point where similar portions are identified and distinguished accurately from dissimilar portions. We 
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also show the effectiveness of the semi-continuous setting and that the shift of the score for computing local assignment 
is highly dependent on the variance of distances between similar points. 

Finally, our results on importance of sample points based on pairwise correspondences computed over the entire 
dataset shows that our framework does as well or better than other approaches while provided a principled way to 
compute correspondences and measure similarity between two trajectories. We feel that, for the purposes of analyzing 
complete datasets for highly conserved portions of trajectories and performing clustering of trajectories on this basis, 
our framework of assignments provides a solid foundation to work with. This direction of research certainly seems 
like a rich one to undertake. 
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Figure 16. Heat maps of the importance of sample points for the buses dataset. (d), (e) and (f) show a zoomed in portion of (a), (b) and (c) 
respectively. 
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Figure 17. Heat maps of the importance of sample points for the GeoLife dataset. 




(a) DTW (b) DTW-Pruned (c) Assignment 

Figure 18. Heat maps of the importance of sample points for the cycling dataset. 
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